feat(PLU-161): flatten_metadata option for databricks_volume_delta_tables#712
Conversation
…bles Adds an opt-in `flatten_metadata` flag to the Volumes Delta Tables uploader. When set, the stager recursively flattens the metadata dict into top-level `metadata_*` columns (stops at lists, preserves the `metadata_` prefix per the PRD) using the existing `flatten_dict` helper. The uploader skips auto-create and intersects incoming columns with the user-managed table schema, dropping unknowns with a log line. Default is False so existing Workflow DB-serialized configs continue to behave identically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stringified unix epochs (e.g. metadata.data_source.date_processed = "1779329564.5102773") are rejected by Databricks's implicit string -> TIMESTAMP cast with CAST_INVALID_INPUT. Run flattened datetime fields through the SQL connector's existing parse_date_string and emit ISO format so TIMESTAMP columns coerce natively. Malformed values pass through unchanged so the table can reject them, matching the non-flatten path's semantics. The known datetime field set is imported from sql.sql._DATE_COLUMNS so this stays in sync with the rest of ingest as new datetime fields land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Finally address unrelated flaky test: #714 |
Let metadata fields flow into Delta as their JSON-native types. Customers designing the user-managed table should declare datetime fields as STRING. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Shadow auto-approve: would not auto-approve because issues were found.
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
… modes After dropping coercion in 0c31cb2 the existing tests no longer exercised how datetime metadata flows through to Delta. Pin the new contract: string and float epochs pass through unchanged in flatten mode, and the blob mode keeps them byte-identical inside the JSON column. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
0 issues found across 1 file (changes from recent commits).
Shadow auto-approve: would require human review. This PR adds opt-in metadata flattening to Databricks volume delta tables, modifying core stager and uploader logic with conditional branching, column filtering, and auto-create bypass—changes that carry risk to data integrity and require human review to ensure correctness before merging.
Re-trigger cubic
…issing Adds a flatten-gated SHOW TABLES check to the Volumes Delta Tables precheck so a misconfigured workflow fails at save time instead of per-document at run time. create_destination under flatten mode becomes an early-return no-op so precheck is the single source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
0 issues found across 2 files (changes from recent commits).
Shadow auto-approve: would require human review. This PR adds an opt-in flatten_metadata feature that modifies stager and uploader logic, including metadata handling, table auto-create skip, and column intersection—changes that affect core ingestion behavior and could have subtle edge cases despite being default-off, so human review is warranted.
Re-trigger cubic
|
@claude review |
|
Claude finished @potter-potter's task —— View job Code Review for PLU-161: flatten_metadata optionTodo List:
✅ Overall AssessmentThis is a well-implemented feature that adds opt-in metadata flattening to Databricks Volume Delta Tables. The implementation is thoughtful, maintains backward compatibility, and includes comprehensive test coverage. 🔍 Detailed ReviewImplementation Quality: ExcellentStrengths:
Key Implementation Details:
Test Coverage: ComprehensiveThe test suite covers all critical paths:
The tests properly verify the datetime handling contract mentioned in the PR description. Security & Robustness: Good
Documentation & Usability: Clear
🔧 Technical Notes
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
0 issues found across 2 files (changes from recent commits).
Shadow auto-approve: would require human review. Adds a new flatten_metadata feature that modifies core stager and uploader logic, including conditional auto-create skipping, schema intersection, and precheck queries, which introduces non-trivial business logic risk and requires human review for correctness and safety.
Re-trigger cubic
|
@claude please give one more review. we dropped the datetime and timestamp conversions because we dont want to do that anymore |
|
Claude finished @potter-potter's task —— View job Code Review for PLU-161 (Updated Review)Todo List:
✅ Updated Review - Datetime/Timestamp Conversion RemovalAfter reviewing the current implementation, I can confirm that the datetime and timestamp conversions have been properly removed. Here's my updated assessment: Key Changes Confirmed:
Implementation Quality: ExcellentStrengths of the Current Approach:
Documentation & Contract:The PR description and field descriptions clearly state the new contract:
Test Coverage: ComprehensiveThe updated test suite properly validates:
✅ Final RecommendationThe datetime/timestamp conversion removal is properly implemented. The changes:
This approach is cleaner and more maintainable than the previous datetime coercion approach. The implementation correctly fulfills the requirement to remove datetime/timestamp conversions while preserving all other functionality. Status: ✅ Ready for merge |
Summary
flatten_metadata: bool(defaultFalse) toDatabricksVolumeDeltaTableUploaderConfigandDatabricksVolumeDeltaTableStagerConfig.metadata_prefix) using the existingflatten_dicthelper (utils/data_prep.py) — stops at lists; matches Milvus's flat naming._fit_to_schemamessage insql/sql.py).upload_stager_configfrom the empty SQL pass-class to the now-non-empty Volumes-specific stager config so the new field surfaces in the dashboard schema. Safe — verified no external imports of the swapped class and no field loss.False, behavior is byte-identical to today (singlemetadata VARIANTblob, default auto-create). Workflow DB-serialized configs without the field deserialize cleanly withflatten_metadata=False.date_created,date_modified,date_processed, and thedata_source_*variants) asSTRING; they arrive as stringified or float unix epochs.Why
Linear: PLU-161. Customers want to query Delta tables via normal SQL columns (`SELECT filename, data_source_url ...`) rather than dotting into a `VARIANT` blob. Per Potter, this is also intended as the reusable pattern for other table-like connectors.
Example job run
All metadata fields are flattened. You can see

languagescomes in as a list,Test plan
metadata_prefix, flatten stops at lists, both configs deserialize with defaultFalsefor Workflow DB backcompat, uploader missing-table hard-fail, uploader PARSE_JSON wrapping on the non-flatten path, uploader unknown-column drop + log.test_databricks_delta_tables.py(SQL connector) and the destination-config JSON-schema test still pass — confirms the registry switch doesn't ripple.🤖 Generated with Claude Code